Forth Decompiler
Volume Number: 1
Issue Number: 2
Column Tag: Forth Forum
“A Forth Decompiler”
By Joerg Langowski
“A Forth Decompiler”
Everyone of you Mac Forth users is familiar enough with Forth to know that it is
a ‘threaded interpretive’ language. A Forth definition (as you type it into your
machine) consists of a string of other, previously defined Forth words, and is compiled
as a string of addresses that point to the definitions of these other words.
This makes for a rather fast interpretation of the resulting code. However, some
of the very primitive words and those words whose execution is time-critical may also
be defined in machine language. MacForth has devised a very elegant way to distinguish
between Forth words defined from within Forth and machine code.
Structure of a Forth Definition
The object code of any Forth word, as it is compiled into the object dictionary,
starts with at least one 16-bit word that is a meaningful executable 68000 machine
language instruction. When the Forth word is executed, the interpreter simply jumps
to this address. Forth definitions (colon definitions, constants, variables) now start
with one of the 68000 TRAP instructions ($4E4X, where X can be anything from $0 to
$F). The corresponding trap vector points to a routine which e.g. in the case of a colon
definition gets the next 16-bit word and interprets it as a Forth token (converts it to
an address and executes it), or - in the case of a variable - puts the address of the
variable on the stack.
If the word is defined completely in machine language, the code is executed until a
special JMP instruction transfers control to the next higher level (I’ll describe that
later).
At this point I have to confess that I would not have come even this far if it had not
been for two excellent routines that I found on a CALL-A.P.P.L.E. public domain disk.
One of those - a Forth decompiler - is included below so that you can enjoy hacking
into the Forth engine, the other one - a disassembler - was too long to be printed here.
An Example
Admittedly, part of the above sounds a bit dry and theoretical. Lets look at a
simple example.
Assume you had defined the word TEST as follows:
: TEST DUP 2* SWAP DROP . ;
The Forth compiler will then create a list of 16-bit words that looks like:
$4E4F (trap for colon definition)
$0498 (Forth token for DUP )
$074E ( “ “ “ 2* )
$049C ( “ “ “ SWAP)
$00EC ( “ “ “ DROP )
$0EBE ( “ “ “ . )
$0060 ( “ “ “ EXIT )
Interpretation always ends at the EXIT token.
TOKENS
What are those ‘tokens’? They are the starting addresses of the Forth definitions
that are offset by a constant that is contained in register A4 (probably to make the
object code relocatable). There is a word in MacForth that converts a token to an
address, TOKEN>ADDR. The token of a word is extracted from the vocabulary by the
Forth word, FIND. Therefore, you will get the starting address of the example above by
executing
FIND TEST TOKEN>ADDR .
The address that you’ll see displayed, of course, depends on how much object code
you have already in your system. Let’s call this number TESTADDR. Then define the
following word:
: TEST.DISP 7 0 DO I 2* TESTADDR ( insert your # here) + W@ .
LOOP ;
and execute TEST.DISP; you will see the list of words above.
This way you can decompile any Forth word that you find in the system. The
decompiler is somewhat more convenient, of course; if you use the procedure above,
you still have to convert the tokens into Forth words. This is done (for one token on the
stack) by executing
NFA ID.
This converts the token (not the pfa, as the Forth 1.1 manual says) into the name
field address (NFA) and then displays the name of the word (ID.).
Machine language definitions
What if the definition is direct machine language code? Again, let us look at an
example, the word SWAP. FIND SWAP TOKEN>ADDR gives (in my system) $5B60. At
this address, however, we find code that does not start with a trap statement; it is a
routine that does what we expect:
202F 0004 MOVE.L 4(A7),D0 /move item below top -> D0
2F57 0004 MOVE.L (A7),4(A7) /move top item one down
2E80 MOVE.L D0,(A7) /move D0 -> top of stack
4ED4 JMP (A4)
/get next token
We see that indeed the two top stack items are exchanged. The last statement is the
end of any machine language Forth definition. This jump to the address in A4 is what I
briefly mentioned above. A4 contains the address of a routine that gets and executes the
next Forth token from the object code (which A3 points to):
MOVE (A3)+,D0
/next object token -> D0
BMI L1
/is it neg. get address from token table
JMP (A4,D0.W)
/jump to start of definition (token + A4)
L1 MOVE(A4,D0.W),D1
/get address from table
JMP (A4,D1.W)
/and jump to start of definition
Hidden definitions
When you decompile (with the program below) the word SELECT.WINDOW, you’ll
see something funny. It seems to be a regular Forth colon definition; however, the
tokens displayed seem to have no name. Only ??? and the token numbers are displayed.
These are tokens whose names have been deleted from the vocabulary, but their
corresponding addresses (A4+token) point to valid definitions. The reason why CSI did
this is probably to keep the vocabulary short and to make words inaccessible to users
whose misuse could have a disasterous effect on the system. Anyway, the word
SUBLEVEL in the definitions below will decompile and display any such ‘hidden’ code, if
it is a colon definition. It will display nothing for machine code definitions, you have to
disassemble them.
SELECT.WINDOW, with this tool, then becomes very clear. Its first level
definition looks like:
: SELECT.WINDOW {2142} {1B0E} ;
where the braces indicate those ‘no-name’ tokens. {2142} merely checks if the
pointer on the stack is a valid window pointer, to keep the toolbox routine from
crashing; decompile it with SUBLEVEL to see what it does exactly. {1B0E} is a 2-word
machine code routine:
$A91F
/toolbox trap for SELECT.WINDOW
JMP (A4)
/and get next word. That’s all!
Listing 1: Forth decompiler
( DECOMPILER Blocks File -- Version 1.00 ) ( ADG - modif.
110384 jl )
DECOMP ( -- )
Decompiles the definition of the next word in the input stream. A
line is displayed for each word in the definition. Each line begins
with its relative code address in hex. Next is the name of the word.
Finally, if the word has an in-line parameter, it is shown. If the
word is a branching word, the value is the target address. If the
parameter is a token, its name is shown. If it is a string, the
string is shown in double-quotes. If it is a word or double-word,
its hex value is followed by its decimal value.
#DECOMP in block 8 can also be loaded by those who wish to write a
routine to pass tokens on the stack to be decompiled. For valid
tokens, its output is identical to that of DECOMP .
Written: 07/21/84 By: Alan D. Galumbeck [70220,200]
NO RIGHTS RESERVED NO RIGHTS RESERVED NO RIGHTS RESERVED
BASE @ DECIMAL VARIABLE HIGH.PFA 16384 MINIMUM.OBJECT 2048
MINIMUM.VOCAB : .DIGITS ( n1\n2 -- | Types the low-order n2
digits of n1 ) 0 <# DO # LOOP #> TYPE ; : SPACE.TO ( n --
| Spaces to column n or 2 spaces if past n ) COL @ - 2 MAX SPACES ;
: DISP.WORD ( pfa -- pfa+2 | Display a 16-bit parameter)
DUP W@ 4 .DIGITS 31 SPACE.TO DECIMAL DUP
NFA ?DUP IF 42 SPACE.TO ID. THEN 2+ ;
: DISP.DBL ( pfa -- pfa+4 | Display a 32-bit parameter )
DUP @ DUP 8 .DIGITS 31 SPACE.TO DECIMAL . HEX 4+ ;
: DISP.STRING ( pfa -- pfa+len | Display a string parameter )
34 EMIT COUNT 2DUP TYPE 34 EMIT + =CELLS ;
: DISP.TARGET ( base.pfa\pfa -- base.pfa\pfa+2 )
( Display a branch target and save if it’s the highest )
DUP IF HIGH.PFA ! ELSE DROP THEN
2DUP SWAP - OVER
: DISP.TOKEN ( pfa -- pfa+2 | Display a token parameter)
DUP W@ NFA ?DUP IF ID. ELSE DUP W@ 4 .DIGITS THEN 2+ ;
: DISP.ADDR ( pfa -- pfa+4 | Display an address parameter )
DUP @ NFA ?DUP IF ID. ELSE DUP @ NEXT.PTR + 8 .DIGITS THEN 4+ ;
: SPECIAL.TOKENS ( base.pfa\pfa\token -- [base.pfa\next.pfa] or
[base.pfa\next.pfa\0] )
( Handle in-line parameters and terminating words )
CASE TOKEN.FOR EXIT OF 0 ENDOF
TOKEN.FOR (;CODE@) OF DISP.TOKEN 0 ENDOF
TOKEN.FOR COMPILE OF DISP.TOKEN ENDOF
TOKEN.FOR 0BRANCH OF DISP.TARGET ENDOF
TOKEN.FOR BRANCH OF DISP.TARGET ENDOF
TOKEN.FOR (OF) OF DISP.TARGET ENDOF
TOKEN.FOR (LOOP) OF DISP.TARGET ENDOF
TOKEN.FOR (+LOOP) OF DISP.TARGET ENDOF
TOKEN.FOR (MENU.SELECTION:) OF DISP.TARGET ENDOF
TOKEN.FOR ALIT OF DISP.ADDR ENDOF
TOKEN.FOR WLIT OF DISP.WORD ENDOF
TOKEN.FOR LIT OF DISP.DBL ENDOF
TOKEN.FOR (.”) OF DISP.STRING ENDOF
TOKEN.FOR ($LIT) OF DISP.STRING ENDOF
TOKEN.FOR (ERROR”) OF DISP.STRING ENDOF
TOKEN.FOR (ABORT”) OF DISP.STRING ENDOF
TOKEN.FOR $ADDR OF DISP.STRING ENDOF
( Insert the ones I’ve missed here. )
0 OF 2 - DISP.TOKEN ENDOF
ENDCASE ;
: DECODE.TOKENS ( pfa -- | Display the words starting at pfa )
DUP HIGH.PFA ! DUP
BEGIN
HEX 2DUP SWAP - CR 4 .R 2 SPACES DUP 2+ SWAP W@ DUP NFA ?DUP
IF ID. ELSE .” ???” drop 0 THEN
20 SPACE.TO SPECIAL.TOKENS ?DUP
IF FALSE ELSE DUP HIGH.PFA @ > THEN
UNTIL
2DROP ;
: .VALUE ( n1\n2 -- | Display constants and UA variables )
HEX .DIGITS .” hex “ DECIMAL . .” decimal )” ;

: DECODE.VECTOR ( pfa\vector -- | Display definition type )
CASE
11 OF .” User Area variable ( Offset = “ W@ DUP 4 .VALUE ENDOF
12 OF .” 16 bit constant ( Value = “
13 OF .” 32 bit constant ( Value = “ @ DUP 8 .VALUE ENDOF
14 OF .” Variable, array, or string” DROP ENDOF
15 OF .” Colon definition” DECODE.TOKENS ENDOF
.” Unknown code type ( Vector = “ 2 .VALUE .” )”
ENDCASE ;

: CHK.CODE.TYPE ( token -- [pfa\vector\true] or [false] |
Returns false for machine code definitions, true for others )
TOKEN>ADDR DUP 2+ SWAP W@ DUP 16/ 1252 =
IF 15 AND TRUE ELSE 2DROP FALSE THEN ;
( Note: 1252 is the machine code for a 68000 TRAP instruction
divided by 16. Vector is the low-order four bits of
the TRAP instruction. )
: sublevel chk.code.type if drop decode.tokens then ;

: DECOMP ( -- | Decompile the next word in the input stream )
GET.LINE.HEIGHT GET.TEXTSIZE BASE @ 9 TEXTSIZE 10 LINE.HEIGHT
+FIND CR POCKET COUNT TYPE .” -- “
IF
IF .” IMMEDIATE “ THEN
CHK.CODE.TYPE IF DECODE.VECTOR ELSE .” Machine code
definition” THEN
ELSE .” Not in dictionary” THEN
BASE ! TEXTSIZE LINE.HEIGHT CR ;

: #DECOMP ( token -- | Decompile word whose token is supplied )
BASE @ GET.LINE.HEIGHT GET.TEXTSIZE 4 PICK DUP
9 TEXTSIZE 10 LINE.HEIGHT NFA ?DUP CR
IF DUP ID. .” -- “ C@ 128 AND
IF .” IMMEDIATE “ THEN
CHK.CODE.TYPE IF DECODE.VECTOR ELSE .” Machine code
definition” THEN
ELSE HEX 4 .DIGITS .” -- Not a valid token” THEN
TEXTSIZE LINE.HEIGHT BASE ! DROP CR ;